Languages cool as they expand: Allometric scaling and the decreasing need for new words 
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We analyze the occurrence frequencies of over 15 million words recorded in millions of books published 
during the past two centuries in seven different languages. For all languages and chronological subsets of 
the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more 
common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling 
relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing 
marginal need for new words, a feature that is likely related to the underlying correlations between words. We 
calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, 
indicating a slowdown in linguistic evolution following language expansion. This "cooling pattern" forms the 
basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature. 



Books in libraries and attics around the world constitute 
an immense "crowd-sourced" historical record that traces the 
evolution of culture back beyond the limits of oral history. 
However, the disaggregation of written language into individ- 
ual books makes the longitudinal analysis of language a diffi- 
cult open problem. To this end, the book digitization project at 
Google Inc. presents a monumental step forward providing an 
enormous, publicly accessible, collection of written language 
in the form of the Google Books Ngram Viewer web applica- 
tion (2- Approximately 4% of all books ever published have 
been scanned, making available over 10 7 occurrence time se- 
ries (word-use trajectories) that archive cultural dynamics in 
seven different languages over a period of more than two cen- 
turies. This dataset highlights the utility of open "Big Data," 
which is the gateway to "metaknowledge" |2|, the knowledge 
about knowledge. A digital data deluge is sustaining extensive 
interdisciplinary research efforts towards quantitative insights 
into the social and natural sciences S0. 

"Culturomics," the use of high-throughput data for the pur- 
pose of studying human culture, is a promising new empirical 
platform for gaining insight into subjects ranging from po- 
litical history to epidemiology JS). As first demonstrated by 
Michel et al. [8], the Google n-gram dataset is well-suited for 
examining the microscopic properties of an entire language 
ecosystem. Using this dataset to analyze the growth patterns 
of individual word frequencies, Petersen et al. J9) recently 
identified tipping points in the life trajectory of new words, 
statistical patterns that govern the fluctuations in word use, 
and quantitative measures for cultural memory. The statistical 
properties of cultural memory, derived from the quantitative 
analysis of individual word-use trajectories, were also inves- 
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tigated by Gao et al. ifTUll . who found that words describing 
social phenomena tend to have different long-range correla- 
tions than words describing natural phenomena. 

Here we study the growth and evolution of written language 
by analyzing the macroscopic scaling patterns that character- 
ize word-use. Using the Google 1-gram data collected at the 
1-year time resolution over the period 1800-2008, we quan- 
tify the annual fluctuation scale of words within a given cor- 
pora and show that languages can be said to "cool by expan- 
sion." This effect constitutes a dynamic law, in contrast to the 
static laws of Zipf and Heaps which are founded upon snap- 
shots of single texts. The Zipf law ifTTlfPTll . quantifying the 
distribution of word frequencies, and the Heaps law |[T3l[T8T 
|20| . relating the size of a corpus to the vocabulary size of that 
corpus, are classic paradigms that capture many complexities 
of language in remarkably simple statistical patterns. While 
these laws have been exhaustively tested on relatively small 
snapshots of empirical data, here we test the validity of these 
laws using extremely large corpora. 

Interestingly, we observe two scaling regimes in the proba- 
bility density functions of word usage, with the Zipf law hold- 
ing only for the set of more frequently used words, referred 
to as the "kernel lexicon" by Ferrer i Cancho et al. Ifl4l . The 
word frequency distribution for the rarely used words consti- 
tuting the "unlimited lexicon" [ 14- 1 obeys a distinct scaling 
law, suggesting that rare words belong to a distinct class. This 
"unlimited lexicon" is populated by highly technical words, 
new words, numbers, spelling variants of kernel words, and 
optical character recognition (OCR) errors. 

Many new words start in relative obscurity, and their even- 
tual importance can be under- appreciated by their initial fre- 
quency. This fact is closely related to the information cost of 
introducing new words and concepts. For single topical texts, 
Heaps observed that the vocabulary size exhibits sub-linear 
growth with document size 1 1 8 1 . Extending this concept to en- 
tire corpora, we find a scaling relation that indicates a decreas- 
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FIG. 1: Two-regime scaling distribution of word frequency. The kink 
in the probability density functions P(f) occurs around / x ~ 10 
for each corpora analyzed (see legend). (A,B) Data from all years are 
aggregated into a single distribution. (C,D) P(f) comprising data 
from only year t = 2000 providing evidence that the distribution is 
stable even over shorter time frames and likely emerges in corpora 
that are sufficiently large to be comprehensive of the language stud- 
ied. For details concerning the scaling exponents we refer to Table*!] 
and the main text. 

ing "marginal need" for new words which are the manifesta- 
tion of cultural evolution and the seeds for language growth. 
We introduce a pruning method to study the role of infrequent 
words on the allometric scaling properties of language. By 
studying progressively smaller sets of the kernel lexicon we 
can better understand the marginal utility of the core words. 
The pattern that arises for all languages analyzed provides in- 
sight into the intrinsic dependency structure between words. 

The correlations in word use can also be author and topic 
dependent. Bernhardsson et al. recently introduced the 
"metabook" concept |fl9l 1201 . according to which word- 
frequency structures are author-specific: the word-frequency 
characteristics of a random excerpt from a compilation of ev- 
erything that a specific author could ever conceivably write 
(his/her "metabook") should accurately match those of the au- 
thor's actual writings. It is not immediately obvious whether 
a compilation of all the metabooks of all authors would still 
conform to the Zipf law and the Heaps law. The immense 
size and time span of the Google n-gram dataset allows us to 
examine this question in detail. 



Results 

Longitudinal analysis of written language. Allometric scal- 
ing analysis [2!j] is used to quantify the role of system size on 
general phenomena characterizing a system, and has been ap- 
plied to systems as diverse as the metabolic rate of mitochon- 
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FIG. 2: Allometric scaling of language. Scatter plots of the out- 
put corpora size N u given the empirical vocabulary size N w using 
all data (U c = 0) over the 209-year period 1800-2008. Shown are 
OLS estimation of the exponent b quantifying the Heaps' law relation 

N w ~ [N u ] b . 

dria lf22l and city growth [23-29]. Indeed, city growth shares 
two common features with the growth of written text: (i) the 
Zipf law is able to describe the distribution of city sizes re- 
gardless of country or the time period of the data 1126*11 . and (ii) 
city growth has inherent constraints due to geography, chang- 
ing labor markets and their effects on opportunities for inno- 
vation and wealth creation 12711281 . just as vocabulary growth 
is constrained by human brain capacity and the varying utili- 
ties of new words across users [14|. 

We construct a word counting framework by first defining 
the quantity Ui(t) as the number of times word i is used in year 
t. Since the number of books and the number of distinct words 
grow dramatically over time, we define the relative word use, 
fi (t), as the fraction of the total body of text occupied by word 
i in the same year 

fi(t) = Ui{i)/N u {t), (1) 

where the quantity N u (t) = ^i^Ti u i(t) i s the total number 
of indistinct word uses while N w (t) is the total number of 
distinct words digitized from books printed in year t. Both 
the N w ("types" giving the vocabulary size) and the N u 
("tokens" giving the size of the body of text) are generally 
increasing over time. 

The Zipf law and the two scaling regimes. Zipf investi- 
gated a number of bodies of literature and observed that the 
frequency of any given word is roughly inversely proportional 
to its rank ifTTI . with the frequency of the z -ranked word given 
by the relation 

/(*) ~ (2) 

with a scaling exponent C ~ 1- This empirical law has 
been confirmed for a broad range of data, ranging from in- 
come rankings, city populations, and the varying sizes of 



FIG. 3: Pruning reveals the variable marginal return of words. The 
Heaps scaling exponent b depends on the extent of the inclusion of 
the rarest words. For a given corpora and U c value we make a scatter 
plot between N w (t\ U c ) and N u (t\U c ) using words with Ui(t) > U c . 
(Panel Inset) We use OLS estimation to estimate the scaling exponent 
b(U c ) for the model N w (t\U c ) ~ [N u (t\U c )] b to show that b{U c ) 
increases from approximately 0.5 towards unity as we prune the cor- 
pora of extremely rare words. Our longitudinal language analysis 
provides insight into the structural importance of the most frequent 
words which are used more times per appearance and which play a 
crucial role in the usage of new and rare words. 



FIG. 4: Pruning reveals the variable marginal return of words. The 
Heaps scaling exponent b depends on the extent of the inclusion of 
the rarest words. For a given corpora and U c value we make a scatter 
plot between N m it\U c ) and N u (t\U c ) using words with in (t) > U c , 
using the same data color-[/ c correspondence as in Fig. [5] (Panel In- 
set) We use OLS estimation to estimate the scaling exponent b(U c ) 
forthe model N w (t\U c ) ~ [N u (t\U c )] h to show that b(U c ) increases 
from approximately 0.5 towards unity as we prune the corpora of ex- 
tremely rare words. Our longitudinal language analysis provides in- 
sight into the structural importance of the most frequent words which 
are used more times per appearance and which play a crucial role in 
the usage of new and rare words. 



avalanches, forest fires 1 30 1 and firm size [31 1 to the linguistic 
features of nonconding DNA |32|. The Zipf law can be de- 
rived through the "principle of least effort," which minimizes 
the communication noise between speakers (writers) and lis- 
teners (readers) [16]. The Zipf law has been found to hold for 
a large dataset of English text lfl4l . but there are interesting de- 
viations observed in the lexicon of individuals diagnosed with 
schizophrenia 1151 . Here, we also find statistical regularity in 
the distribution of relative word use for 1 1 different datasets, 
each comprising more than half a million distinct words taken 
from millions of books (8] . 

Figure [T] shows the probability density functions P(f) re- 
sulting from data aggregated over all the years (A,B) as well 
as over 1-year periods as demonstrated for the year t = 2000 
(C,D). Regardless of the language and the considered time 
span, the probability density functions are characterized by 
a striking two-regime scaling, which was first noted by Ferrer 
i Cancho and Sole 1141 . and can be quantified as 

p , ,s j f~ a ~ , if / < fx ["unlimited lexicon"] 
\ f~ a+ , if / > fx ["kernel lexicon"] . 

These two regimes, designated "kernel lexicon" and "unlim- 
ited lexicon," are thought to reflect the cognitive constraints 
of the brain's finite vocabulary lfl4l . The specialized words 



found in the unlimited lexicon are not universally shared and 
are used significantly less frequently than the words in the ker- 
nel lexicon. This is reflected in the kink in the probability 
density functions and gives rise to the anomalous two-scaling 
distribution shown in Fig.[T] 

The exponent a + and the corresponding rank-frequency 
scaling exponent £ in Eq. |2]i are related asymptotically by 

mi 

a+ » 1 + 1/0 (4) 

with no analogous relationship for the unlimited lexicon 
values a_ and f_. Table [I] lists the average a+ and a_ 
values calculated by aggregating a± values for each year 
using a maximum likelihood estimator for the power-law 
distribution [ 33 1 . We characterize the two scaling regimes 
using a crossover region around f x « 10~ 5 to distinguish 
between a_ and a+: (i) 10~ 8 < / < 10~ 6 corresponds to 
a_ and (ii) 10~ 4 < / < 10 _1 corresponds to a + . For the 
words that satisfy / > / x that comprise the kernel lexicon, 
we verify the Zipf scaling law £ « 1 (corresponding to 
a s» 2) for all corpora analyzed. For the unlimited lexicon 
regime / < / x , however, the Zipf law is not obeyed, as we 
find a_ ~ 1.7. Note that a_ is significantly smaller in the 
Hebrew, Chinese, and the Russian corpora, which suggests 
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TABLE I: Summary of the scaling exponents characterizing the Zipf law and the Heaps law. To calculate a r (t\f c } (see Figs. [6] and [7) we 
use only the relatively common words that meet the criterion that their average word use (fi) over the entire word history is larger than a 
threshold f c — 10 /Min[N u )(t)] listed in the first column for each corpus. The b values shown are calculated using all words (U c ~ 0). The 
"unlimited lexicon" scaling exponent ct-(t) is calculated for 10 -8 < / < 1CP 6 and the "kernel lexicon" exponent a+(t) is calculated for 
1(T 4 < / < KT 1 using the maximum likelihood estimator method for each year. The average and standard deviation ({•••) i cr) listed are 
computed using the a+(t) and q_ (t) values over the 209-year period 1800-2008 (except for Chinese, which is calculated from 1950-2008 

data). We show the Zipf scaling exponent calculated as f = 1/1 («+} — 1 1 . The last column indicates the j3 scaling exponents from Fig. 71 A). 
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that a more generalized version of the Zipf law 111411 may be 
needed, one which is slightly language-dependent, especially 
when taking into account the usage of specialized words from 
the unlimited lexicon. 

The Heaps law and the increasing marginal returns of new 
words. Heaps observed that vocabulary size, i.e. the number 
of distinct words, exhibits a sub-linear growth with document 
size lfl8l . This observation has important implications for the 
"return on investment" of a new word as it is established and 
becomes disseminated throughout the literature of a given lan- 
guage. As a proxy for this return, Heaps studied how often 
new words are invoked in lieu of preexisting competitors and 
examined the linguistic value of new words and ideas by ana- 
lyzing the relation between the total number of words printed 
in a body of text N u , and the number of these which are dis- 
tinct N w , i.e. the vocabulary size |fT8| . The marginal returns 
of new words, dN u /dN w quantifies the impact of the addition 
of a single word to the vocabulary of a corpus on the aggregate 
output (corpus size). 

For individual books, the empirically-observed scaling re- 
lation between N u and N w obeys 

N w ~ {N u )\ (5) 

with b < 1, with Eq. <|3j referred to as "the Heaps law". It has 
subsequently been found that Heaps' law emerges naturally 
in systems that can be described as sampling from an underly- 
ing Zipf distribution. In an information theoretic formulation 
of the the abstract concept of word cost, B. Mandelbrot pre- 
dicted the relation b — 1/Cin 1961 [34], where £ is the scaling 



exponent corresponding to a + , as in Eqs. Q and Q. This 
prediction is limited to relatively small texts where the unlim- 
ited lexicon, which manifests in the a_ regime, does not play 
a significant role. A mathematical extension of this result for 
general underlying rank-distributions is also provided by Kar- 
lin (35 1 using an infinite urn scheme, and extended to broader 
classes of heavy-tailed distributions recently by Gnedin et al. 
113611 . Recent research efforts using stochastic master equa- 
tion techniques to model the growth of a book have also pre- 
dicted this intrinsic relation between Zipf's law and Heaps' 
law fT3ll37ll38l. 

Figure [2] confirms a sub-linear scaling (b < 1) between N u 
and N w for each corpora analyzed. These results show how 
the marginal returns of new words are given by 

which is an increasing function of N w for b < 1. Thus, the 
relative increase in the induced volume of written languages 
is larger for new words than for old words. This is likely due 
to the fact that new words are typically technical in nature, re- 
quiring additional explanations that put the word into context 
with pre-existing words. Specifically, a new word requires 
the additional use of preexisting words as a result of both (i) 
the explanation of the content of the new word using existing 
technical terms, and (ii) the grammatical infrastructure nec- 
essary for that explanation. Hence, there are large spillovers 
in the size of the written corpus that follow from the intricate 
dependency structure of language stemming from the various 
grammatical roles [39, 40). 
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FIG. 5: Literary productivity and vocabulary size in the Google Inc. 1-gram dataset over the past two centuries. (A) Total size of the different 
corpora N u (t\U c ) over time, calculated by using words that satisfy Ui(t) > U c = 16 to eliminate extremely rare 1-grams. (B) Size of the 
written vocabulary N w (t\U c ) over time, calculated under the same conditions as (A). 
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FIG. 6: Non-stationarity in the characteristic growth fluctuation of 
word use. The standard deviation a r {t\fc) of the logarithmic growth 
rate ri(t) is presented for all examined corpora. There is an over- 
all decreasing trend arising from the increasing size of the corpora, 
as depicted in Fig. [5|A). On the other hand, the steady production 
of new words, as depicted in Fig.[5|B) counteracts this effect. We 
calculate a r (t\f c ) using the relatively common words that meet the 
criterion that their average word use (/») over the entire word his- 
tory Ti (using words with lifetime T% > 10 years) is larger than a 
threshold f c = l/Min[N u (t)] (see Table III. 



In order to investigate the role of rare and new words, we 
calculate N u and N w using only words that have appeared at 
least U c times. We select the absolute number of uses as a 
word use threshold because a word in a given year can not 
appear with a frequency less than 1/N U , hence any criteria 
using relative frequency would necessarily introduce a bias 
for small corpora samples. This choice also eliminates words 
that can spuriously arise from Optical Character Recognition 
(OCR) errors in the digitization process and also from intrinsic 



spelling errors and orthographic spelling variations. 

Figures [3] and |4] show the relational dependence of N u and 
N w on the exclusion of low-frequency words using a variable 
cutoff U c = 2™ with n = . . . 11. As U c increases the Heaps 
scaling exponent increases from b « 0.5, approaching b w 1, 
indicating that core words are structurally integrated into lan- 
guage as a proportional background. Interestingly, Altmann et 
al. BTll recently showed that "word niche" can be an essential 
factor in modeling word use dynamics . New niche words, 
though they are marginal increases to a language's lexicon, 
are themselves anything but "marginal" - they are core words 
within a subset of the language. This is particularly the case in 
online communities in which individuals strive to distinguish 
themselves on short timescales by developing stylistic jargon, 
highlighting how language patterns can be context dependent. 

We now return to the relation between Heaps' law and 
Zipf's law. Table [I] summarizes the b values calculated by 
means of ordinary least squares regression using U c = to 
relate N u (t) to N w (t). For U c = 1 we find that b ss 0.5 
for all languages analyzed, as expected from Heaps law, but 
for U c > 8 the b value significantly deviates from 0.5, and 
for U c > 1000 the b value begins to saturate approaching 
unity. Considering that a + s» 2 implies £ « 1 for all 
corpora, Figures [3] and |4] shows that we can confirm the 
relation b(U c ) ~ 1/C only for the more pruned corpora 
that require relatively large U c . This hidden feature of 
the scaling relation highlights the underlying structure of 
language, which forms a dependency network between the 
common words of the kernel lexicon and their more esoteric 
counterparts in the unlimited lexicon. Moreover, the function 
dN w /dN u ~ {N u ) h ^ 1 is a monotonically decreasing 
function for b < 1, demonstrating the decreasing marginal 
need for additional words as a corpora grows. In other words, 
since we get more and more "mileage" out of new words in 
an already large language, additional words are needed less 
and less. 
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FIG. 7: Growth fluctuation of word use scale with the size of the corpora. (A) Depicted is the quantitative relation in Eq.{8| between a r (t\f c ) 
and the corpus size N u (t\f c ). We calculate cr r (t\f c ) using the relatively common words that meet the criterion that their average word use 
(/j) over the entire word history (using words with lifetime Ti > 10 years) is larger than a threshold f c = 10/Min[N u (t)] (see Table|l|. 
We show the language-dependent scaling value /3 ~ 0.08 — 0.35 in each panel. For each language we show the value of the ordinary least 
squares best-fit f3 value with the standard error in parentheses. (B) Summary of j3(U c ) exponents calculated using a use-threshold U c , instead 
of a frequency threshold f c as used in (A). Error bars indicate the standard error in the OLS regression. We perform this additional analysis in 
order to provide alternative insight into the role of extremely rare words. For increasing U c the /3(U C ) value for each corpora increases from 
P ~ 0.05 to ft < 0.25. This language pruning method quantifies the role of new rare words (also including OCR errors, spelling and other 
orthographic variants), which are the significant components of language volatility. 



Corpora size and word-use fluctuations. Lastly, it is in- 
structive to examine how vocabulary size N w and the overall 
size of the corpora N u affect fluctuations in word use. Fig- 
ure [5] shows how N w (t) and N u (t) vary over time over the 
past two centuries. Note that, apart from the periods during 
the two World Wars, the number of words printed, which we 
will refer to as the "literary productivity", has been increas- 
ing over time. The number of distinct words (vocabulary size) 
has also increased reflecting basic social and technological ad- 
vancement [8|. 

To investigate the role of fluctuations, we focus on the loga- 
rithmic growth rate, commonly used in finance and economics 



n (t) ee \nfi(t + At)- In fi(t)= In 



Mt + At) 



fi(t) 



(7) 



to measure the relative growth of word use over 1 -year peri- 
ods, At ee 1 year. Recent quantitative analysis on the distribu- 
tion P(r) of word use growth rates ri(t) indicates that annual 
fluctuations in word use deviates significantly from the pre- 
dictions of null models for language evolution [9|. 

We define an aggregate fluctuation scale, a r (t\f c ), us- 
ing a frequency cutoff f c oc l/Min[N u (t)] to eliminate 
infrequently used words. The quantity Min[N u (t)] is the 
minimum corpora size over the period of analysis, and so 
l/Min[N u (t)] is an upper bound for the minimum observed 
frequency for words in the corpora. Figure [6] shows a r (t\f c ), 
the standard deviation of rj(t) calculated across all words 
that satisfy the condition (/,) > f c for words with lifetime 
Ti > 10 years, using f c = X/Min[N u (t)]. Visual inspection 



suggests a general decrease in a r (t\f c ) over time, marked by 
sudden increases during times of political conflict. Hence, the 
persistent increase in the volume of written language is corre- 
lated with a persistent downward trend what could be thought 
of as the "system temperature" cr r (t\f c ): as a language grows 
and matures it also "cools off". 

Since this cooling pattern could arise as a simple artifact of 
an independent identically distributed (i.i.d) sampling from an 
increasingly large dataset, we test the scaling of cr r (t\f c ) with 
corpora size. Figure |7jA) shows that for large N u (t), each 
language is characterized by a scaling relation 



<Jr{t\fc) - N u (t\f c )-P 



(8) 



with language-dependent scaling exponent j3 « 0.08 — 0.35. 
We use f c — 10 /Min[N u (t)], which defines the frequency 
threshold for the inclusion of a given word in our analysis. 
There are two candidate null models which give insight into 
the limiting behavior of /3. The Gibrat proportional growth 
model predicts (3 = and the Yule- Simon urn model pre- 
dicts p — 1/2 E21. We observe ^ < 1/2, which indicates 
that the fluctuation scale decreases more slowly with increas- 
ing corpora size than would be expected from the Yule-Simon 
urn model prediction, deducible via the "delta method" for 
determining the approximate scaling of a distribution and its 
standard deviation a |43 1. 

To further compare the roles of the kernel lexicon versus the 
unlimited lexicon, we apply our pruning method to quantify 
the dependence of the scaling exponent (3 on the fluctuations 
arising from rare words. We omit words from our calcula- 
tion of a r (t\U c ) if their use itj(t) in year t falls below the 
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word-use threshold U c . Fig.^B) shows that f3(U c ) increases 
from values close to to values less than 1/2 as U c increases 
exponentially. An increasing f3(U c ) confirms our conjecture 
that rare words are largely responsible for the fluctuations in a 
language. However, because of the dependency structure be- 
tween words, there are residual fluctuation spillovers into the 
kernel lexicon likely accounting for the fact that /3 < 1/2 even 
when the fluctuations from the unlimited lexicon are removed. 

A size-variance relation showing that larger entities have 
smaller characteristic fluctuations was also demonstrated at 
the scale of individual words using the same Google n-gram 
dataset [9]. Moreover, this size-variance relation is strikingly 
analogous to the decreasing growth rate volatility observed as 
complex economic entities (i.e. firms or countries) increase in 
size Il42ll44l - |48l , which strengthens the analogy of language 
as a complex ecosystem of words governed by competitive 
forces. 

Further possible explanations for (3 < 1/2 is that language 
growth is counteracted by the influx of new words which tend 
to have growth-spurts around 30-50 years following their birth 
in the written corpora 0. Moreover, the fluctuation scale 
oy(i|/ c ) is positively influenced by adverse conditions such 
as wars and revolutions, since a decrease in N u (t) may de- 
crease the competitive advantage that old words have over new 
words, allowing new words to break through. The globaliza- 
tion effect, manifesting from increased human mobility dur- 
ing periods of conflict, is also responsible for the emergence 
of new words within a language. 



Discussion 

A coevolutionary description of language and culture re- 
quires many factors and much consideration l49l |5Q| . While 
scientific and technological advances are largely responsible 
for written language growth as well as the birth of many new 
words 1 9 1, socio-political factors also play a strong role. For 
example, the sexual revolution of the 1960s triggered the sud- 
den emergence of the words "girlfriend" and "boyfriend" in 
the English corpora U], illustrating the evolving culture of ro- 
mantic courting. Such technological and socio-political per- 
turbations require case-by-case analysis for any deeper under- 
standing, as demonstrated comprehensively by Michel et al. 

Here we analyzed the macroscopic properties of written 
language using the Google Books database Q). We find that 
the word frequency distribution P(f) is characterized by two 
scaling regimes. While frequently used words that constitute 
the kernel lexicon follow the Zipf law, the distribution has a 
less-steep scaling regime quantifying the rarer words consti- 
tuting the unlimited lexicon. Our result is robust across lan- 
guages as well as across other data subsets, thus extending 
the validity of the seminal observation by Ferrer i Cancho and 
Sole lfl4ll . who first reported it for a large body of English 
text. The kink in the slope preceding the entry into the un- 
limited lexicon is a likely consequence of the limits of human 
mental ability that force the individual to optimize the usage 



of frequently used words and forget specialized words that 
are seldom used. This hypothesis agrees with the "principle 
of least effort" that minimizes communication noise between 
speakers (writers) and listeners (readers), which in turn may 
lead to the emergence of the Zipf law ifTBI . 

Using an extremely large written corpora that documents 
the profound expansion of language over centuries, we ana- 
lyzed the dependence of vocabulary growth on corpus growth 
and validate the Heaps law scaling relation given by Eq. [5] 
Furthermore we systematically prune the corpora data using 
a word occurrence threshold U c , and comparing the resulting 
b(U c ) value to the 1 value, which is stable since it is de- 
rived from the "kernel" lexicon. We conditionally confirm the 
theoretical prediction ( w 1/b [ 13 34-38], which we validate 
only in the case that the extremely rare "unlimited" lexicon 
words are not included in the data sample (see Figs. [3]and|4jl. 

The economies of scale {b < 1) indicate that there is an 
increasing marginal return for new words, or alternatively, a 
decreasing marginal need for new words, as evidenced by al- 
lometric scaling. This can intuitively be understood in terms 
of the increasing complexities and combinations of words that 
become available as more words are added to a language, less- 
ening the need for lexical expansion. However, a relationship 
between new words and existing words is retained. Every in- 
troduction of a word, from an informal setting (e.g. an ex- 
pository text) to a formal setting (e.g. a dictionary) is yet an- 
other chance for the more common describing words to play 
out their respective frequencies, underscoring the hierarchy 
of words. This can be demonstrated quite instructively from 
Eq. ^ which implies that for b = 1/2 that |^ oc N w , 
meaning that it requires a quantity proportional to the vocab- 
ulary size N w to introduce a new word, or alternatively, that a 
quantity proportional to N w necessarily results from the addi- 
tion. 

Though new words are needed less and less, the expansion 
of language continues, doing so with marked characteristics. 
Taking the growth rate fluctuations of word use to be a kind 
of temperature, we note that like an ideal gas, most languages 
"cool" when they expand. The fact that the relationship be- 
tween the temperature and corpus volume is a power law, one 
may, loosely speaking, liken language growth to the expansion 
of a gas or the growth of a company ll42l RTI | 4 8l . In contrast 
to the static laws of Zipf and Heaps, we note that this finding 
is of a dynamical nature. 

Other aspects of language growth may also be understood 
in terms of expansion of a gas. Since larger literary produc- 
tivity imposes a downward trend on growth rate fluctuations 
— which also implies that the ranking of the top words and 
phases becomes more stable llBD — productivity itself can be 
thought of as a kind of inverse pressure in that highly pro- 
ductive years are observed to "cool" a language off. Also, it 
is during the "high-pressure" low productivity years that new 
words tend to emerge more frequently. 

Interestingly, the appearance of new words is more like gas 
condensation, tending to cancel the cooling brought on by lan- 
guage expansion. These two effects, corpus expansion and 
new word "condensation," therefore act against each other. 
Across all corpora we calculate a size-variance scaling ex- 
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ponent < f3 < 1/2, bounded by the prediction of j3 = 
(Gibrat growth model) and (3 = 1/2 (Yule-Simon growth 
model) (la. 

In the context of allometric relations, Bettencourt et al. G7l 
note that the scaling relations describing the dynamics of cities 
show an increase in the characteristic pace of life as the sys- 
tem size grows, whereas those found in biological systems 
show decrease in characteristic rates as the system size grows. 



Since the languages we analyzed tend to "cool" as they ex- 
pand, there may be deep-rooted parallels with biological sys- 
tems based on principles of efficiency [16). Languages, like 
biological systems demonstrate economies of scale (b < 1) 
manifesting from a complex dependency structure that mim- 
ics a hierarchical "circulatory system" required by the organi- 
zation of language [39 52 - 56] and the limits of the efficiency 
of the speakers/writers who exchange the words |fl9l |4"T1 l57l . 



[1] Google Books Ngram Viewer, 

|http : / /books . google . com/ngrams| 
[2] Evans, J. A. and Foster, J. G. Metaknowledge. Science 331, 

721-725 (2011). 

[3] Ball, P. Why Society is a Complex Matter. (Springer- Verlag, 
Berlin, 2012). 

[4] Helbing, D., Balietti, S. How to Create an Innovation Acceler- 
ator. Eur. Phys. J. Special Topics 195, 101-136 (2011). 

[5] Lazer, D., et al. Computational social science. Science 323, 
721-723 (2009). 

[6] Barabasi, A. L. The network takeover. Nature Physics 8, 14-16 
(2012). 

[7] Vespignani, A. Modeling dynamical processes in complex 

socio-technical systems. Nature Physics 8, 32-39 (2012). 
[8] Michel, J.-B., et al. Quantitative analysis of culture using mil- 
lions of digitized books. Science 331, 176-182 (201 1). 
[9] Petersen, A. M., Tenenbaum, I., Havlin, S., and Stanley, H. E. 
Statistical laws governing fluctuations in word use from word 
birth to word death. Scientific Reports 2, 313 (2012). 

[10] Gao, I., Hu, I., Mao, X., and Perc, M. Culturomics meets ran- 
dom fractal theory: Insights into long-range correlations of so- 
cial and natural phenomena over the past two centuries. J. R. 
Soc. Interface 9, 1956-1964 (2012). 

[11] Zipf, G. K. Human Behavior and the Principle of Least-Effort: 
An Introduction to Human Ecology. Addison-Wesley, Cam- 
bridge, MA, (1949). 

[12] Tsonis, A. A., Schultz, C, and Tsonis, P. A. Zipf's law and 
the structure and evolution of languages. Complexity 3, 12-13 
(1997). 

[13] Serrano, M. A., Flammini, A., and Menczer, F. Modeling sta- 
tistical properties of written text. PLoS ONE 4, e5372 (2009). 

[14] Ferrer i Cancho, R. and Sole, R. V. Two regimes in the fre- 
quency of words and the origin of complex lexicons: Zipf's 
law revisited. Journal of Quantitative Linguistics 8, 165-173 
(2001). 

[15] Ferrer i Cancho, R. The variation of Zipf's law in human lan- 
guage. Eur. Phys. J. B 44, 249-257 (2005). 

[16] Ferrer i Cancho, R. and Sole, R. V. Least effort and the origins 
of scaling in human language. Proc. Natl. Acad. Sci. USA 100, 
788-791 (2003). 

[17] Baek, S. K., Bernhardsson, S., and Minnhagen, P. Zipf's law 
unzipped. New J. Phys. 13, 043004 (2011). 

[18] Heaps, H. S. Information Retrieval: Computational and Theo- 
retical Aspects. (Academic Press, New York, 1978). 

[19] Bernhardsson, S., Correa da Rocha, L. E., and Minnhagen, P. 
The meta book and size-dependent properties of written lan- 
guage. New J. Phys. 11, 123015 (2009). 

[20] Bernhardsson, S., Correa da Rocha, L. E., and Minnhagen, P. 
Size-dependent word frequencies and translational invariance 
of books. Physica A 389, 330-341 (2010). 

[21] Kleiber, M. Body size and metabolism. Hilgardia 6, 315-351 



(1932). 

[22] West, G. B. Allometric scaling of metabolic rate from 
molecules and mitochondria to cells and mammals. Proc. Natl. 
Acad. Sci. USA 98, 2473-2478 (2002). 

[23] Makse, H. A., Havlin, S., and Stanley, H. E. Modelling urban 
growth patterns. Nature 377, 608-612 (1995). 

[24] Makse, H. A., Jr., J. S. A., Batty, M., Havlin, S., and Stanley, 
H. E. Modeling urban growth patterns with correlated percola- 
tion. Phys. Rev. E 58, 7054-7062 (1998). 

[25] Rozenfeld, H. D., Rybski, D., Andrade Jr., J. S., Batty, M., Stan- 
ley, H. E., and Makse, H. A. Laws of population growth. Proc. 
Natl. Acad. Sci. USA 48, 18702-18707 (2008). 

[26] Gabaix, X. Zipf's law for cities: An explanation. Quarterly 
Journal of Economics 114, 739-767 (1999). 

[27] Bettencourt, L. M. A., Lobo, I., Helbing, D., Kuhnert, C, and 
West, G. B. Growth, innovation, scaling, and the pace of life in 
cities. Proc. Natl. Acad. Sci. USA 104, 7301-7306 (2007). 

[28] Batty, M. The size, scale, and shape of cities. Science 319, 
769-771 (2008). 

[29] Rozenfeld, H. D., Rybski, D., Gabaix, X., and Makse, H. A. 
The area and population of cities: New insights from a different 
perspective on cities. American Economic Review 101, 2205- 
2225 (2011). 

[30] Newman, M. E. J. Power laws, Pareto distributions and Zipf's 
law. Contemporary Phys. 46, 323-351 (2005). 

[31] Stanley, M. H. R., Buldyrev, S. V., Havlin, S., Mantegna, R., 
Salinger, M., and Stanley, H. E. Zipf plots and the size distri- 
bution of firms. Econ. Lett. 49, 453-457 (1995). 

[32] Mantegna, R. N., et al. Systematic analysis of coding and non- 
coding DNA sequences using methods of statistical linguistics. 
Phys. Rev. E 52, 2939-2950 (1995). 

[33] Clauset, A., Shalizi, C. R., and Newman, M. E. J. Power-law 
distributions in empirical data. SIAM Rev. 51, 661-703 (2009). 

[34] Mandelbrot, B. On the theory of word frequencies and on re- 
lated Markovian models of discourse, in: R. Jakobson, Struc- 
ture of Language and its Mathematical Aspects. Proceedings of 
Symposia in Applied Mathematics Vol. XII, 190-219 (1961). 

[35] Karlin, S. Central limit theorems for certain infinite urn 
schemes. Journal of Mathematics and Mechanics 17, 373^-01 
(1967). 

[36] Gnedin, A., Hansen, B., Pitman, J. Notes on the occupancy 
problem with infinitely many boxes: general asymptotics and 
power laws. Probability Surveys 4, 146-171 (2007). 

[37] van Leijenhorst, D. C, van der Weide, Th. P. A formal deriva- 
tion of Heaps' Law. Inform. Sci. 170, 263-272 (2005). 

[38] Lii, L., Zhang, Z-K., Zhou, T. Zipf's law leads to Heaps' law: 
Analyzing their relation in finite-size systems. PLoS One 5, 
el4139 (2010). 

[39] Steyvers, M. and Tenenbaum, J. B. The large-scale structure of 
semantic networks: Statistical analyses and a model of semantic 
growth. Cogn. Sci. 29, 41-78 (2005). 



9 



[40] Markosova, M. Network model of human language. Physica A 

387, 661-666 (2008). 
[41] Altmann, E. G., Pierrehumbert, J. B., and Motter, A. E. Niche 

as a determinant of word fate in online groups. PLoS ONE 6, 

el9009(2011). 

[42] Riccaboni, M., Pammolli, F., Buldyrev, S. V., Ponta, L., and 
Stanley, H. E. The size variance relationship of business firm 
growth rates. Proc. Natl. Acad. Sci. USA 105, 19595-19600 

(2008) . 

[43] Oehlert, G. W. A Note on the Delta Method. The American 
Statistician 46, 27-29 (1992). 

[44] Amaral, L. A. N., et al. Scaling Behavior in Economics: I. Em- 
pirical Results for Company Growth. J. Phys. I France 7, 621- 
633 (1997). 

[45] Amaral, L. A. N., et al. Power Law Scaling for a System of 
Interacting Units with Complex Internal Structure. Phys. Rev. 
Lett. 80, 1385-1388 (1998). 

[46] Fu, D., Pammolli, E, Buldyrev, S. V., Riccaboni, M., Matia, 
K., Yamasaki, K., and Stanley, H. E. The growth of business 
firms: Theoretical framework and empirical evidence. Proc. 
Natl. Acad. Sci. USA 102, 18801-18806 (2005). 

[47] Podobnik, B., Horvatic, D., Petersen, A. M., Stanley, H. E. 
Quantitative relations between risk, return, and firm size. EPL 
85, 50003 (2009). 

[48] Podobnik, B., Horvatic, D., Petersen, A. M., Njavro, M., Stan- 
ley, H. E. Common scaling behavior in finance and macroeco- 
nomics. Eur. Phys. J. B 76, 487^90 (2010). EPL 85, 50003 

(2009) . 

[49] Mufwene, S. The Ecology of Language Evolution. (Cambridge 

Univ. Press, Cambridge, UK, 2001). 
[50] Mufwene, S. Language Evolution: Contact, Competition and 

Change. (Continuum International Publishing Group, New 

York, NY, 2008). 

[51] Perc, M. Evolution of the most common English words and 
phrases over the centuries. J. R. Soc. Interface 9, 3323-3328 
(2012). 

[52] Sigman, M. and Cecchi, G. A. Global organization of the word- 
net lexicon. Proc. Natl. Acad. Sci. USA 99, 1742-1747 (2002). 



[53] Alvarez-Lacalle, E., Dorow, B., Eckmann, J. -P., and Moses, 
E. Hierarchical structures induce long-range dynamical corre- 
lations in written texts. Proc. Natl. Acad. Sci. USA 103, 7956- 
7961 (2006). 

[54] Altmann, E. A., Cristadoro, G., and Esposti, M. D. On the 
origin of long-range correlations in texts. Proc. Natl. Acad. Sci. 
USA 109, 11582-11587 (2012). 

[55] Montemurro, M. A. and Pury, P. A. Long-range fractal correla- 
tions in literary corpora. Fractals 10, 451-461 (2002). 

[56] Corral, A., Ferrer i Cancho, R., and Dfaz-Guilera, A. Univer- 
sal complex structures in written language. arXiv.090 1.2924 
(2009). 

[57] Altmann, E. G., Pierrehumbert, J. B., and Motter, A. E. Be- 
yond word frequency: bursts, lulls, and scaling in the temporal 
distributions of words. PLoS ONE 4, e7678 (2009). 



Acknowledgments 

AMP acknowledges support from the IMT Lucca Foundation. 
JT, SH and HES acknowledge support from the DTRA, ONR, 
the European EPIWORK and LINC projects, and the Israel 
Science Foundation. MP acknowledges support from the 
Slovenian Research Agency. 



Author Contributions 

A. M. P., J. T., S. H., H. E. S., & MP designed re- 
search, performed research, wrote, reviewed and approved the 
manuscript. A. M. P. performed the numerical and statistical 
analysis of the data. 



